04. Crawler Output
Step 2. Crawler Output
Now the crawler can load its configuration and run, but it does not know how to tell you about its results. Let's fix that!
You will need to print the results to a JSON file using this format:
Example JSON Output
{
"wordCounts": {
"foo": 54,
"bar": 23,
"baz": 14
},
"urlsVisited": 12
}
wordCounts
- The mapping of popular words. Each key is a word that was encountered during the web crawl, and each value is the total number of times a word was seen.When computing these counts for a given crawl, results from the same page are never counted twice.
The size of the returned map should be the same as the "popularWordCount" option in the crawler configuration. For example, if "popularWordCount" is 3, only the top 3 most frequent words are returned.
The keys and values should be sorted so that the more frequent words come first. If multiple words have the same frequency, prefer longer words rank higher. If multiple words have the same frequency and length, use alphabetical order to break ties (the word that comes first in the alphabet ranks higher). You can use the existing
Comparator
class insrc/main/java/com/udacity/webcrawler/WordCounts.java
to do this.urlsVisited
- The number of distinct URLs the web crawler visited.A URL is considered "visited" if the web crawler attempted to crawl that URL, even if the HTTP request to download the page returned an error.
When computing this value for a given crawl, the same URL is never counted twice.
Implementing Crawler Output
Now, it's time to fill in src/main/java/com/udacity/webcrawler/json/CrawlResultWriter.java
. This should feel similar to the last step, but this time you are writing to a file (or a Writer
) instead of reading. Just like for the ConfigurationLoader
, you should use a ObjectMapper
from the Jackson library, but this time call the ObjectMapper#writeValue
method.
Once you are done, make sure the tests pass:
mvn test -Dtest=CrawlResultWriterTest
Hint: If a test fails due to a Stream being closed twice, try calling ObjectMapper#disable(Feature)
with the com.fasterxml.jackson.core.JsonGenerator.Feature.AUTO_CLOSE_TARGET
feature. This will prevent Jackson from closing the Writer
in CrawlResultWriter#write(Writer)
, since you should have already closed it in CrawlResultWriter#write(Path)
.